TensorFlow 调试程序（二）

Original Google TensorFlow 2021-07-27

调试 TensorFlow Estimator

本部分介绍了如何调试使用 Estimator API 的 TensorFlow 程序。这些 API 提供的部分便利性是它们在内部管理 Session。这样一来，上面的部分介绍的 LocalCLIDebugWrapperSession 就不适用了。幸运的是，您仍然可以使用 tfdbg 提供的特殊 hook 调试它们。

tfdbg 可以调试 tf-learn Estimator 的 train()、evaluate() 和 predict() 方法。要调试 Estimator.train()，请创建一个 LocalCLIDebugHook 并将其用作 hooks 参数的一部分。例如：

# First, let your BUILD target depend on "//tensorflow/python/debug:debug_py"
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug

# Create a LocalCLIDebugHook and use it as a monitor when calling fit().
hooks = [tf_debug.LocalCLIDebugHook()]

# To debug `train`:
classifier.train(input_fn,
steps=1000,
hooks=hooks)

同样，要调试 Estimator.evaluate() 和 Estimator.predict()，请为 hooks 参数分配钩子，如下例所示：

# To debug `evaluate`:
accuracy_score = classifier.evaluate(eval_input_fn,
hooks=hooks)["accuracy"]

# To debug `predict`:
predict_results = classifier.predict(predict_input_fn, hooks=hooks)

debug_tflearn_iris.py 包含如何搭配使用 tfdbg 和 Estimator 的完整示例。要运行此示例，请执行以下命令：

python -m tensorflow.python.debug.examples.debug_tflearn_iris --debug

LocalCLIDebugHook 还允许您配置 watch_fn，后者可用于灵活指定在不同的 Session.run() 调用期间要查看哪些 Tensor，这些调用作为 fetches 和 feed_dict 以及其他状态的函数。如需了解详情，请参阅此 API 文档（https://tensorflow.google.cn/api_docs/python/tfdbg/DumpingDebugWrapperSession?hl=zh-CN#__init__）。

使用 TFDBG 调试 Keras 模型

要结合使用 TFDBG 和 tf.keras，请允许 Keras 后端使用 TFDBG 封装的会话对象。例如，要使用 CLI 封装容器，请运行以下代码：

import tensorflow as tf
from tensorflow.python import debug as tf_debug

tf.keras.backend.set_session(tf_debug.LocalCLIDebugWrapperSession(tf.Session()))

# Define your keras model, called "model".

# Calls to `fit()`, 'evaluate()` and `predict()` methods will break into the
# TFDBG CLI.
model.fit(...)
model.evaluate(...)
model.predict(...)

稍加修改后，前面的代码示例也适用于针对 TensorFlow 后端运行的非 TensorFlow 版 Keras。您只需用 keras.backend 替换 tf.keras.backend 即可。

使用 TFDBG 调试 tf-slim

TFDBG 支持对 tf-slim 进行训练和评估调试。如下所述，训练和评估需要略微不同的调试工作流程。

在 tf-slim 中调试训练流程

要调试训练流程，需要将 LocalCLIDebugWrapperSession 提供给 slim.learning.train() 的 session_wrapper 参数。例如：

# Import libraries for simulation
import tensorflow as tf
import numpy as np

# Imports for visualization
import PIL.Image
from io import BytesIO
from IPython.display import Image, display

在 tf-slim 中调试评估流程

要调试评估流程，需要将 LocalCLIDebugHook 提供给 slim.evaluation.evaluate_once() 的 hooks 参数。例如：

import tensorflow as tf
from tensorflow.python import debug as tf_debug

# ... Code that creates the graph and the eval and final ops ...
tf.contrib.slim.evaluation.evaluate_once(
'',
checkpoint_path,
logdir,
eval_op=my_eval_op,
final_op=my_value_op,
hooks=[tf_debug.LocalCLIDebugHook()])

离线调试远程运行的会话

您的模型往往在您没有终端访问权限的远程机器或进程上运行。要在这种情况下调试模型，您可以使用 tfdbg 的 offline_analyzer 二进制文件（如下所述）。它在转储的数据目录上运行。可以对较低阶的 Session API 以及较高阶的 Estimator API 执行此操作。

调试远程 tf.Sessions

如果您直接与 tf.Session API（python 版）互动，则可以使用 tfdbg.watch_graph 方法配置对其调用 Session.run() 方法的 RunOptions 原型。这样一来，在发生 Session.run() 调用时，中间张量和运行时图会被转储到您选择的共享存储位置（以降低性能为代价）。例如：

from tensorflow.python import debug as tf_debug

# ... Code where your session and graph are set up...

run_options = tf.RunOptions()
tf_debug.watch_graph(
run_options,
session.graph,
debug_urls=["file:///shared/storage/location/tfdbg_dumps_1"])
# Be sure to specify different directories for different run() calls.

session.run(fetches, feed_dict=feeds, options=run_options)

之后，在您拥有终端访问权限的环境（例如，一台可以访问上述代码指定的共享存储位置的本地计算机）中，您可以使用 tfdbg 的 offline_analyzer 二进制文件加载和检查共享存储上的转储目录中的数据。例如：

python -m tensorflow.python.debug.cli.offline_analyzer \
--dump_dir=/shared/storage/location/tfdbg_dumps_1

Session 封装容器 DumpingDebugWrapperSession 提供了一种更简单、更灵活的方法来生成可离线分析的文件系统转储。要使用该方法，只需将会话封装到 tf_debug.DumpingDebugWrapperSession 中即可。例如：

# Let your BUILD target depend on "//tensorflow/python/debug:debug_py
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug

sess = tf_debug.DumpingDebugWrapperSession(
sess, "/shared/storage/location/tfdbg_dumps_1/", watch_fn=my_watch_fn)

watch_fn 参数接受 Callable，而后者允许您配置在不同的 Session.run() 调用期间要查看哪些 tensor，这些调用作为 run() 调用的 fetches 和 feed_dict 及其他状态的函数。

C++ 和其他语言

如果您的模型代码是采用 C++ 或其他语言编写的，则您还可以修改 RunOptions 的 debug_options 字段以生成可离线检查的调试转储。要了解详情，请参阅原型定义（https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/core/protobuf/debug.proto）。

调试远程运行的 Estimator

如果您在远程 TensorFlow 服务器上运行 Estimator，则可以使用非交互式 DumpingDebugHook。例如：

# Let your BUILD target depend on "//tensorflow/python/debug:debug_py
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug

hooks = [tf_debug.DumpingDebugHook("/shared/storage/location/tfdbg_dumps_1")]

然后，可以按照与本文档前面部分介绍的 LocalCLIDebugHook 示例一样的方法使用此 hook。在训练、评估或预测 Estimator 期间，tfdbg 会创建具有以下名称格式的目录：

/shared/storage/location/tfdbg_dumps_1/run_<epoch_timestamp_microsec>_<uuid>。每个目录对应一个 Session.run() 调用，而此调用会成为 fit() 或 evaluate() 调用的基础。您可以使用 tfdbg 提供的 offline_analyzer 加载这些目录并以离线方式在命令行界面中进行检查。例如：

python -m tensorflow.python.debug.cli.offline_analyzer \
--dump_dir="/shared/storage/location/tfdbg_dumps_1/run_<epoch_timestamp_microsec>_<uuid>"

常见问题解答

问：lt 输出左侧的时间戳是否反映了非调试会话的实际性能？

答：否。调试程序在图中插入了其他特殊用途的调试节点来记录中间张量的值。这些节点减缓了图的执行。如果您对分析模型感兴趣，请查看：

tfdbg 的分析模式：tfdbg> run -p
tfprof 和 TensorFlow 的其他分析工具

问：如何在 Bazel 中将 tfdbg 与我的 Session 关联起来？为什么我会看到 “ImportError: cannot import name debug” 这样的错误？

答：在 BUILD 规则中，声明依赖项 "//tensorflow:tensorflow_py" 和 "//tensorflow/python/debug:debug_py"。所包含的第一个依赖项让您即使没有调试程序支持也可以使用 TensorFlow；第二个用于启用调试程序。然后，在您的 Python 文件中，添加：

from tensorflow.python import debug as tf_debug

# Then wrap your TensorFlow Session with the local-CLI wrapper.
sess = tf_debug.LocalCLIDebugWrapperSession(sess)

问：tfdbg 是否可以帮助调试运行时错误（例如形状不匹配）？

答：可以。tfdbg 在运行时期间会拦截指令生成的错误，并在 CLI 中向用户显示具体错误以及一些调试说明。请查看下面的示例：

# Debugging shape mismatch during matrix multiplication.
python -m tensorflow.python.debug.examples.debug_errors \
--error shape_mismatch --debug

# Debugging uninitialized variable.
python -m tensorflow.python.debug.examples.debug_errors \
--error uninitialized_variable --debug

问：如何让 tfdbg 封装的会话或钩子仅通过主线程运行调试模式？

答：这是一个常见用例，其中 Session 对象同时在多个线程中使用。通常情况下，子线程负责后台任务，例如运行入列指令。您通常仅需要调试主线程（或者不太频繁地仅调试一个子线程）。您可以使用 LocalCLIDebugWrapperSession的 thread_name_filter 关键字参数实现这种类型的线程选择性调试。例如，您要仅通过主线程进行调试，请按如下方式构造一个封装的 Session：

sess = tf_debug.LocalCLIDebugWrapperSession(sess, thread_name_filter="MainThread$")

以上示例的前提是 Python 中的主线程具有默认名称 MainThread。

问：我正在调试的模型非常大。tfdbg 转储的数据占满了磁盘的可用空间。我该怎么做？

答：出现以下任何情况，您都可能会遇到此问题：

模型具有很多中间张量
中间张量非常大
很多 tf.while_loop 迭代

有三种可能的解决方案：

LocalCLIDebugWrapperSession 和 LocalCLIDebugHook 的构造函数提供了一个关键字参数 dump_root，用于指定 tfdbg 转储调试数据的路径。您可以使用此参数让 tfdbg 将调试数据转储到可用空间比较多的磁盘上。例如：

# For LocalCLIDebugWrapperSession
sess = tf_debug.LocalCLIDebugWrapperSession(dump_root="/with/lots/of/space")

# For LocalCLIDebugHook
hooks = [tf_debug.LocalCLIDebugHook(dump_root="/with/lots/of/space")]

确保 dump_root 指向的目录为空或不存在。在退出之前，tfdbg 会清理转储目录。

减小在运行期间使用的批次大小
使用 tfdbg 的 run 命令的过滤选项只查看图形中的特定节点。例如：

tfdbg> run --node_name_filter .*hidden.*
tfdbg> run --op_type_filter Variable.*
tfdbg> run --tensor_dtype_filter int.*

上面的第一个命令仅查看名称符合正则表达式格式 .*hidden.* 的节点。上面的第二个命令仅查看名称符合格式 Variable.* 的操作。上面的第三个命令仅查看 dtype 符合格式 int.*（例如 int32）的张量。

问：为什么不能在 tfdbg CLI 中选择文本？

答：这是因为 tfdbg CLI 默认在终端中启用了鼠标事件。此 mouse-mask 模式会替换默认的终端交互，包括文本选择。您可以通过使用命令 mouse off 或 m off 来重新启用文本选择。

注：mouse-mask 链接

https://linux.die.net/man/3/mousemask

问：为什么我在调试如下代码时，tfdbg CLI 没有显示转储的张量？

a = tf.ones([10], name="a")
b = tf.add(a, a, name="b")
sess = tf.Session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.run(b)

答：您之所以没有看到转储数据，是因为执行的 TensorFlow 图中的每个节点都由 TensorFlow 运行时进行了常数折叠处理。在本示例中，a 是一个常数张量；因此，已获取的张量 b 其实也是一个常数张量。TensorFlow 的图优化将包含 a 和 b 的图折叠成单个节点，以加快图的未来运行速度，因此，tfdbg 不会生成任何中间张量转储。不过，如果 a 是一个 tf.Variable，如下例所示：

import numpy as np

a = tf.Variable(np.ones[10], name="a")
b = tf.add(a, a, name="b")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.run(b)

则不会发生常数折叠，tfdbg 应显示中间张量转储。

问：我正在调试一个产生垃圾无穷数或 NaN 的模型。但是，我的模型中有一些节点已知会在输出张量中产生无穷值或 NaN，即使在完全正常的条件下也是如此。我如何在 run -f has_inf_or_nan 操作期间跳过这些节点？

答：使用 --filter_exclude_node_names（简称为 -fenn）标记。例如，如果您知道您有一个名称符合正则表达式 .*Sqrt.* 的节点，无论模型是否正常运行，该节点都会产生无穷数或 NaN，那么您可以使用命令 run -f has_inf_or_nan -fenn .*Sqrt.* 将该节点从无穷数 /NaN-finding 运行中排除。

问：是否有用于 tfdbg 的 GUI？

答：有，TensorBoard 调试程序插件就是 tfdbg 的 GUI。它提供了诸如计算图检查、张量值实时可视化、张量连续性和条件性断点以及将张量关联到其图形构建源代码等功能，所有这些功能都在浏览器环境中运行。要开始使用，请访问相关 README 文件（https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md）。

更多 AI 相关阅读：

bxrf的瓜

”FAN某”的离婚财产分割判决书（全文）

”FAN某”的离婚财产分割判决书（全文）

公益慈善｜“翼行天下一生守护”慈善项目捐赠仪式圆满举行！

何炅突然高调官宣喜讯，网友恭喜：30年了，终于等到这一天！